Online learning method in a decision system

ABSTRACT

A learning model is initiated during start-up learning to activate operation of a decision system. During operation of the decision system, data is qualified for use in online learning. Online learning allows a system to adapt or learn application dependent parameters to optimize or maintain its performance during normal operation. Methods for qualifying data for use in online learning include thresholding of features, restriction of score space for qualified objects, and using a different source of information than is used in the decision process. Clustering methods are used to improve the quality of the learning model. Using the cumulative distribution function to compare two distributions and produce a measure of similarity derives a metric for learning maturity.

TECHNICAL FIELD

[0001] This invention relates to online learning in a decision system.

BACKGROUND OF THE INVENTION

[0002] Many decision systems are applied in a variety of dynamicsituations where a system must adapt or learn its internal structure orparameters for each situation in order to optimize or maintain itsperformance. For example, an image-based decision system that inspectssemiconductor wafers for defects or measures system alignment andregistration precision can expect the image characteristics to change asnew layers are added to the wafer during various processing stages.Other dynamic applications include live cell analysis that tracks agroup of cells over time or patient specific medical image analysis(such as MRI, CT, X-ray) that needs to account for the differencesbetween imaging system setups or patients' characteristics and responseto the imaging system. There are also many non-image based dynamicdecision system applications, including data mining and decision supportin security and financial applications, speech recognition, mobilerobots, medical diagnosis and treatment support, etc.

[0003] Ideally, a learning process would take place in a supervisedlearning fashion wherein domain experts provide the truth labelsassociated with input data (labeled data). [Sergios Thodoridis andKonstantinos Koutroumbas, Pattern Recognition, Academic Press, 1999 pp3-7.] This results in learning data for which the desired outcome hasbeen provided and the system learns to match particular distributions ofinputs to the associated outputs (desired response). In practice,however, such truth labeling presents significant problems in time,expense, and availability of data that often preclude its use.Furthermore, explicit learning after a system is placed into productioninterrupts normal system operation, which could significantly impact theproductivity of the system.

[0004] It is desirable for the system to be able to learn online whileperforming productive work without taking time away from production forlearning and requiring minimal user intervention. To do this, the systemmust automatically learn from unlabeled data to improve the system'sperformance on future data. This is related to but different from theprior art approach known as unsupervised learning [Sergios Thodoridisand Konstantinos Koutroumbas, Pattern Recognition, Academic Press, 1999pp 351-383.]

[0005] I. Learning Environment

[0006] Ideally, a decision system that needs to adapt to changedapplication characteristics would have an opportunity to learn about theapplication data in a highly supervised environment. A domain expertwould inspect a large number of subjects and identify the ones it shouldalarm on and the ones it should not and provide the information to thedecision system. In practice, however, there are reasons why this maynot be practical or possible. Data collection can be time-intensive,especially when the system is continually being exposed to newconditions. Expert truth labelers are sometimes not available in atimely fashion. In addition, for assembly-line processes, such asmanufactured part defect inspection, it may simply be impossible to takethe system offline for sufficient time to enable the domain expert tolabel a large volume of data and present it to the system for learning.To deal with this problem, the invention separates learning into twophases, startup learning 102 and online learning 106 shown in FIG. 1.The startup learning approach assumes a very small amount of labeleddata 100 and so imposes strong constraints on the shape of the datadistribution models which can be learned so as to minimize the variancethat results from limited sample size. Online learning 106 performsfurther learning by automatically acquiring large volumes of learningdata, which is not labeled data. A decision process 104 produces thedecision output.

[0007] Feature Distribution Modeling

[0008] One of the prior art concepts of decision or classification isthe representation of the data being classified, such as images orsampled sounds or more abstract objects, in terms of a set of featureseach of which is represented as discrete or continuous values. [L.Breiman, J. Friedman, R. Olshen, C. Stone, “Classification andRegression Trees”, CRC Press LLC, 1998, pp 1-17] For example, amechanism for inspecting people for potential health problems might useas its features cholesterol level, blood pressure, and resting pulserate.

[0009] For a particular population, the set of samples has a particulardistribution across the feature space. FIG. 2 shows an example bloodpressure feature probability density distribution for healthy 200 andunhealthy 202 patients.

[0010] A higher-level feature (or combination of features) that wasspecifically designed to be an indicator of a healthy patient (i.e. anycondition on which we wish to alarm) is derived. This could beaccomplished by incorporating a large variety of lower-level features toarrive at some output for which increasing value indicates higherprobability. To identify the percent of the population that were mostlikely to be healthy based on the blood pressure feature, we would usethe cumulative distribution function (CDF). The CDF of a one-dimensionalprobability density distribution f(x′) is described as:c_(f)(x) = ∫_(−∞)^(x)f(x^(′))  x^(′)

[0011] In this example, x is the blood pressure value and the value ofC_(f) for the healthy population distribution 200 is the percent of thepopulation likely to be healthy.

[0012] To do automated decision making, first measure (or model) theprobability density distributions for the populations of interest.Classification is done by thresholding these distributions. There aretwo basic paradigms for modeling densities: functional and empirical. Afunctional model is constrained to achieving a particular shape, andneed only learn the parameters. For example, a distribution model mightbe constrained to a normal distributions and need only learn the meanand variance. An empirical distribution uses actual data to construct adistribution and therefore generally requires much more data in order toreduce variance due to limited sample size.

[0013] A prior art class of models called kernel-based models[Theodoridis, S., Koutroumbas, K., “Pattern Recognition”, AcademicPress, pp.41-44, 1999], has densities in the one-dimensional case thattake the form${f(x)} = {\sum\limits_{i}^{\quad}\quad {w_{i}{g\left( {x - x_{i}} \right)}}}$

[0014] under the condition that

∫f(x)dx=1

[0015] where g(x) is the kernel distribution, x_(i) is the location ofatom i, and w_(i) is the weight of atom i. A commonly used kerneldistribution is a Gaussian distribution. When g(x) is a Gaussiandistribution. ${\sum\limits_{i}^{\quad}\quad w_{i}} = 1.$

[0016] An example of a one-dimensional kernel-based model for 3 atoms isshown in FIG. 3. In the example, there are three atoms 300, 302, 304 atlocations 0.2, 0.5, and 0.7 with weights of 0.7, 0.4, and 0.8respectively. The individual component density for atom 1, 301, atom 2,303, and atom 3, 305, are weighted and summed to construct the overallmodel density 306. A simple way to produce a model of this type from aset of sampled empirical data is to set all the weights the same and usethe samples themselves as atoms. The variance of the underlying Gaussiandistributions can be estimated as the variance in the samplesthemselves.

OBJECTS AND ADVANTAGES

[0017] The invention seeks to provide continuing system improvementthrough online learning. A further objective is to automatically qualifydata for learning use. A still further objective is to provide anindication of learning maturity to guide users in effective managementof the learning system. A further objective is to minimize error inpreferred regions of a probability density function model by emphasizingthe regions of importance in the data space before clustering.

SUMMARY OF THE INVENTION

[0018] An online learning method for a decision system is described thatuses qualified objects to train the system during operation. A measureof learning maturity is disclosed that exhibits the system's learnedcondition in light of the data that it is processing. Using differentinformation than is used for the decision classifier, or by havingdifferent thresholds than are used for the decision output doesqualification of data for learning. Methods for clustering data togetherenhance the distribution of data used to learn.

BRIEF DESCRIPTION OF THE DRAWINGS

[0019] The preferred embodiments and other aspects of the invention willbecome apparent from the following detailed description of the inventionwhen read in conjunction with the accompanying drawings which areprovided for the purpose of describing embodiments of the invention andnot for limiting same, in which:

[0020]FIG. 1 shows a block diagram of a decision system processing flow,including startup and online learning processes;

[0021]FIG. 2 shows a example diagram portraying densities of healthy andunhealthy patients as a function of systolic blood pressure;

[0022]FIG. 3 shows a simple kernel-based density model with threeweighted atoms and a Gaussian kernel;

[0023]FIG. 4 shows weighted atoms that could be used to enable akernel-based model to achieve a piecewise-uniform functional shape;

[0024]FIG. 5 shows a block diagram of the process flow of onlinelearning;

[0025]FIG. 6A shows a defect detection classifier used to separateobjects into very likely defects or other;

[0026]FIG. 6B shows a non-defect qualifier used to separate objects intovery likely non- defects or other;

[0027]FIG. 7A shows one object qualification method in detection scorespace.

[0028]FIG. 7B shows a different object qualification method that usesanother feature space to perform the qualification;

[0029]FIG. 8 shows an example of feature scores for atoms beingconsidered for clustering;

[0030]FIG. 9 shows a warping of clustering space that enables particularregions to be identified as regions where maintenance of exact atompositions is more desirable (or regions where it is unimportant);

DETAILED DESCRIPTION OF THE INVENTION

[0031] A kernel-based model has the flexibility to accurately representa wide variety of distributions but requires a large amount of data inorder to do so without a damaging amount of variance. If the sample setis too small, one model can vary widely from another model with bothhaving the same number of samples (but different samples) from the samedistribution. One embodiment of the invention combines the advantages ofkernel-based and empirical modeling by choosing as atoms for the initiallearning set a functionally derived set of atoms that constrains thekernel-based model and allows a limited number of shapes of the modeldistribution thereby assuring model stability even with a very smallnumber of learning samples.

[0032] An example is shown in FIG. 4 for a one-dimensional feature thathas a model of the feature score distribution that is a step function.The feature score model is${f(x)} = {{\overset{I_{0} - 1}{\sum\limits_{i = 0}^{\quad}}\quad {{wg}\left( {x - \frac{i}{I}} \right)}} + {\sum\limits_{i = I_{0}}^{I - 1}\quad {{rwg}\left( {x - \frac{i}{I}} \right)}}}$where $w = \frac{1}{I_{0} + {r\left( {I - I_{0}} \right)}}$

[0033] for a model with I atoms, a transition from high-density to lowdensity at ${x_{0} = \frac{I_{0}}{I}},$

[0034] and a desired ratio of density r. Given these 2 constraints, alllearning needs to do is extract the value x₀ 400 at which the transitionoccurs and the value r which indicates the degree of transition. In oneembodiment of the invention, when the number of samples used formodeling is small, the maximally scored sample is used to determine thevalue x₀.

[0035] The reason this approach is advantageous over approaches thatdirectly use a functional model (such as a simple piecewise uniformmodel) is because it can model the tail of the distribution better. Theexample functional/empirical approach of the invention providesflexibility in successive stages of learning. An empirical featuredistribution function could be determined and used to calculate the atomlocations and weights. When sufficient learning data has been acquired,a switch from a functional to an empirical model can be made withoutchanging the classification system at all. This example is not intendedto constrain the invention, which is applicable multi-dimensionally andwith different modeling functions.

[0036] I. Online Learning

[0037] The processing flow of one embodiment of the online learningmethod of the invention is shown in FIG. 5. Objects are collected duringoperation of the decision system 500 where it is being applied to realuse. The collected objects could include normal or defective samples.The online learning method of the invention 508 updates the learningmodel. No human is involved in the online learning process to review thecollected objects. In one embodiment of the invention, non-defect modelsare updated. To assure that defective objects are not included in thenon-defect model updating process, an online learning qualifier 502 isapplied to the collected objects 500 to qualify them for model update504 or disqualify them for model update 506. Only qualified objects areused to update the learning model. Those skilled in the art shouldrecognize that the model update could be for a defect model in whichcase the qualifier will qualify defect rather than non-defect samplesfor online learning.

[0038] II. Online Learning Qualifier

[0039] During normal processing of input subjects for decision making,the system has access to large amounts of raw data. However, without anexpert labeler the system does not know a priori which data correspondto objects of interest (such as defects) and which correspond to normalobjects (such as non-defects). One of the primary functions of thisinvention is to improve the model of the normal (non-anomalous) datadistribution. However, this must be done without corrupting the model byincorporating data that are actually from anomalous subjects on whichthe system would like to alarm. If this happens in sufficient amounts,the system will come to regard such data as normal and lose its abilityto classify defects. To avoid this, a number of methods are taught forqualifying data and updating the non-defect model as differentembodiments of the invention.

[0040] The distinction between qualification and detection is subtlebecause on a gross level, both are dividing objects into defects andnon-defects. The decision strategies are different between a defectdetection classifier and the non-defect qualifier. In an embodiment, thepurpose of detection is to alarm only on objects that are very likely tobe defects in order to minimize false alarm rate. Thus, for a defectdetection classifier, the objects that are not alarmed could be normalsamples or missed defects. Therefore, the not-alarmed objects cannot besafely assumed to be non-defect and used for non defect model onlinelearning. The purpose of the qualifier, on the other hand, is toidentify only objects that are very likely to be non-defects, as theywill be used to improve the non- defect model during online learning.Therefore, the strategy is to minimize defect objects that arequalified.

[0041] The difference between prior art classification and qualificationof this invention can be further illustrated by use of three relatedperspectives:

[0042] 1. Specifics of the categories into which objects will be divided(and purpose of division),

[0043] 2. Score space over which the categorization boundary will becreated,

[0044] 3. Source of information on which to base the categorizationboundary.

[0045] In one embodiment of the invention, a qualifier is merely aclassifier with different thresholds. The difference between thenon-defect qualifier (FIG. 6B) and the defect detection classifier (FIG.6A) may be simply viewed as a difference in thresholds on the sameconfidence measure. In FIG. 6A the threshold 604 is set to detectdefects that lie in region 608. In FIG. 6B the threshold 606 is set toqualify non-defects that lie in region 610.

[0046] In another embodiment of this invention, a method similar to theabove example is shown in FIG. 7A. FIG. 7A shows a hypotheticalbackground probability density distribution 701 for normal objects and adefect probability density distribution 703 plotted for a score that theonline learning system will attempt to model. Hypothetical inputs thatare received are shown as arrows in the diagram distributed according totheir score. A parameter that is an estimate of the fraction of imageslikely to be defects is used to select a fraction of the highest-scoringobjects (above threshold 700) to remove from the online learning set (ordisqualify them) before using the remainder to update the defect model.Objects 704, 705, 706, 707, 708, 709, 710 are accepted and 711, 712 isrejected for online learning.

[0047] A third embodiment of the qualifier could use differentinformation (i.e. different features) than the defect detector did inthe decision-making process. FIG. 7B shows a different distribution anda different threshold 720. On the basis of this different information,different non-defect objects are qualified as shown in FIG. 7B. Arrows704, 705, 706, 707, 708, 709, 711 indicates accepted non-defect objects.Using the different information qualifier that operates in a differentfeature space and has a different threshold 720, we could accept some ofthe non-defect objects (such as 711). Even though the FIG. 7Bqualification shows 710 is not qualified, the qualified objects are abetter representation of the overall distribution of data because thereis no longer a hard cutoff that loses the tail information in thedecision feature space. Note that the difference in feature spacechanges the probability density distributions for normal objects 701 anddefect objects 703 in the two different feature spaces (and also thatthe objects themselves lie in different positions). Yet objects arerecovered for learning that lie within the tail of the distribution ofthe score (the distribution used for defect detection in this example).

[0048] To preserve the independence of the qualifier from the maindecision making classifier, a different mechanism could be used forqualification than the cut-off approach illustrated in FIG. 7B. In apreferred embodiment of the invention, a regulation tree classifier[disclosed in U.S. patent application Ser. No. 09/972,057 on Oct. 5,2001 by Shih-Jong J. Lee entitled, “Method for Regulation of HierarchicDecisions in Intelligent Systems” which is incorporated in its entiretyherein] is used for the online learning qualifier. Different algorithmsfor qualification/disqualification of objects to the existing qualifiercould be based on other heuristic algorithms. In one embodiment of theinvention, objects with confidence scores in the lowest Nth percentileof the collected data are considered to be so likely to be non-defectsthat they are automatically qualified. In another embodiment of theinvention, objects which are in the highest percentile and which exceeda certain critical hard coded (i.e. fixed, not floating) score areconsidered so likely to be defects that they are automaticallydisqualified regardless of what the outcome of the tree mentioned aboveis. Those skilled in the art should recognize that the model updatecould be for a defect model in which case the qualifier will qualifydefect rather than non-defect samples for online learning.

[0049] III.2 Learning Model Update

[0050] Once we have obtained at least one qualified object 504 from theonline data collection process 500, the next step is to update thedecision system learning model 508 to reflect the newly gatheredinformation. If the number of data points is small enough, an embodimentcould simply add them each into the kernel-based model as atoms with aweight of 1.0. However, as the number of data points gets large this canbecome prohibitively expensive in terms of computation time and/orstorage. Another embodiment of the invention performs clustering(grouping) on the atoms to represent them using a smaller number ofatoms corresponding to cluster centers. Many clustering algorithms exist[Keinosuke Fukunaga, “Introduction to Statistical Pattern Recognition”,Academic Press, 1990, pp 508-531]. The goal of the preferred embodimentis to take advantage of the knowledge of regions of the feature spacethat are most critical to be accurately represented to optimize theperformance of the system.

[0051] Many clustering methods rely on combining individual values insuch a way as to attempt to minimize an error function that may be afunction of:

[0052] 1. the distance between the original point location and the pointinto which it is clustered;

[0053] 2. the weight of the point which is being clustered.

[0054] For example, an atom with weight w_(i), and location x_(i) thatis being considered for clustering to a new atom at a location x′ mighthave a contribution to the error produced by that clustering ofsomething as simple as:

e _(i=|x) _(i) −x|w _(i)

[0055] The disadvantage of this function from the point of view of adecision system is that it gives equal weight to clusters in all regionsof the space. If a learning system has a goal of establishing athreshold which corresponds to alarming on objects which are in thehighest 5% of the population, for example, it is crucial that thelearning system represents the distribution as accurately as possible inthis range of the feature space of interest. In the preferred embodimentof the invention, the learning system allows optimization at a range ofalarm rates. This specifies a region of feature space over which thesystem needs to maximize CDF resolution. Take the example in FIG. 8; thetwo atoms 802, 803 at 0.8 and 0.84 (pair B) and 800, 801 at 0.1 and 0.2(pair A) are candidates for clustering together. In this example, theCDF operating range for this system is between 0.9 and 0.99 (to achievealarm rates between 10% and 1%, respectively). Thus a large shifting inthe density model in the region between 0.1 and 0.2 will have littleeffect on the estimates of locations producing CDF values between 0.9and 0.99, but a small shift in the density model at scores between 0.8and 0.84 could have a large effect. So it will be appropriate to clusterpair A rather than pair B.

[0056] The preferred embodiment of this invention reflects thisinformation by performing a warping of the score space in such a waythat distances in the less interesting range are decreased and distancesin the range of interest are magnified. A piecewise linear version ofthis warping function is illustrated in FIG. 9. In FIG. 9 sample 800 iswarped to 900, sample 801 is warped to 901, sample 802 is warped to 902,and 803 is warped to 903 before clustering is performed on the warpedscore.

[0057] After the warping, the distance between new pair A 900, 901 is0.05, whereas the distance between pair B 902, 903 is 0.08 in the warpedscore. This conveys to the clustering algorithm the information that itis more desirable to cluster 900, 901 than 902, 903. By choosing theinflection points (e.g. 904) to put greater slopes above the regions anapplication wishes to best model, it can minimize error in selectedregions of the CDF.

[0058] In the preferred embodiment of the invention, the warpingfunction is described as follows. The warping function is a piecewiselinear function described by a series of points

V₀, V₁,.V_(n)., V_(N)

[0059] such that

V_(n−1)<V_(n) for all n ∈{1, 2, . . . , N}.

[0060] and a set of slopes

m₁, m₂, . . . , m_(n), . . . , m_(N)

[0061] such that

m_(n)>0 for all n ∈{1, 2, . . . , N}.

[0062] Define

B=max {n|x>v_(n)}+1

[0063] Then the warped location of an input atom x is described as:${w(x)} = {{\sum\limits_{n = 1}^{B - 1}\quad {m_{n}\left( {v_{n} - v_{n - 1}} \right)}} + {m_{B}\left( {x - v_{B - 1}} \right)}}$

[0064] This function is defined over the region

[0065] [v₀, V_(N)].

[0066] Choosing a warping function is a matter of selecting regions[v_(n−1), v_(N)] to emphasize by setting m_(n) to a relatively highvalue, or conversely to de-emphasize by setting m_(n) to a relativelylow value. The regions with the most emphasis will be least likely tohave clustering performed, and thus will most retain the accuracy ofrepresentation of the distribution through the clustering process.

[0067] Those skilled in the art should recognize that other linear ornonlinear functions could be used as the warping function.

[0068] Maturity Estimation

[0069] When a system can learn not just once but at various points inits use, an important function is its ability to estimate the degree towhich it has learned and the additional learning which is useful. Inthis invention, these concepts are expressed through an output referredto as the learning maturity score. This value is updated with each newinput that is provided to the system, and provides users with acontinuously updated view of the system's self-evaluation ofperformance. If successive learning sessions fail to increase thisnumber significantly, then further learning is not necessary. If thetrend of the learning maturity score drops off, it is an indicator thatthe system has started to encounter new data characteristics and itneeds additional learning.

[0070] The underlying concept of the learning maturity score is toestimate how well the probability density distribution of feature scoresfor detection subjects (i.e. operational distribution) matches thedistributions obtained during system learning for the same featurescores. One method of estimating learning maturity is to accumulatestatistical qualities of each feature separately. The feature scoredistribution obtained during system learning is expressed as aCumulative Distribution Function (CDF, c_(f)(x)), having input featurescore values x and an output value of 0.5 for the average value of thefeature score. During operation of the system after learning, thefeature scores are mapped using the learned CDF function to CDF outputvalues. The average of the CDF values so obtained is a normalizedexpression of their average probability density compared to the 0.5 CDFvalue that was obtained during learning. If the average value obtainedduring operation differs from 0.5, the amount of difference is a measureof dissimilarity of the operating distribution (for a particular featurescore) from the original learned distribution. This concept is utilizedin the preferred embodiment of the invention for maturity measurement.

[0071] Consider a sequence of feature representations of incomingsubjects x_(i)(k), where i is the index on the subject and k ∈{0, 1, . .. K−1} is the index on the feature. Calculate the appropriate CDF value

c(k)=C_(f)(x_(i)(k))

[0072] Then calculate a representation of the average of this valueacross the range of past subjects. For example, we could maintain awindowed average of recent data: Mean CDF value Range${{\overset{\_}{c}}_{i}(k)} = {c(k)}$

i = 0${{\overset{\_}{c}}_{i}(k)} = {{\frac{i}{i + 1}{{\overset{\_}{c}}_{i - 1}(k)}} + {\frac{1}{i + 1}{c(k)}}}$

i < N${{\overset{\_}{c}}_{i}(k)} = {{\frac{N}{N + 1}{{\overset{\_}{c}}_{i - 1}(k)}} + {\frac{1}{N + 1}{c(k)}}}$

i ≧ N

[0073] Then, to calculate the maturity estimate for that feature,calculate:

m′_(i)(k)=1−2|{overscore (c)}_(i)(k)−0.5|

[0074] m′_(i)(k) has its maximum (i.e. 1.0) when the mean CDF value is0.5, and decreases the further the CDF gets from 0.5. Finally, combinethe estimates from the various scores:$m_{i} = {\frac{1}{K}{\sum\limits_{k = 0}^{K - 1}\quad {{m_{i}^{\prime}(k)}.}}}$

[0075] Those skilled in the art should recognize that other means ofcomparing two distributions could also be used to estimate a maturityscore. For example, the χ² test or the Kolmogorov-Smirnov test [Press,W., Flannery, B., Teukolsky, S., Vetterling, W., “Numerical Recipes inC”, Cambridge University Press, 1988, PP.487-494] can be used to comparetwo distributions. Generally, the more alike the two distributions are,the more mature the learning.

[0076] The invention has been described herein in considerable detail inorder to comply with the Patent Statutes and to provide those skilled inthe art with the information needed to apply the novel principles and toconstruct and use such specialized components as are required. However,it is to be understood that the inventions can be carried out byspecifically different distributions and feature combination scores, andthat various modifications, both as to the details and procedures, canbe accomplished without departing from the scope of the inventionitself.

What is claimed is:
 1. An online learning method for a decision systemhaving a learning model comprises the following steps: (a) Collecting aplurality of objects during operation of the decision system; (b)Applying an online learning qualifier to the collected plurality ofobjects to obtain at least one qualified object; (c) Updating thedecision system using the at least one qualified object.
 2. The methodof claim 1 wherein the decision system contains a learning model andupdate of the decision system updates the learning model.
 3. The methodof claim 1 wherein the online learning qualifier selects a fraction ofhighest-scoring objects to disqualify
 4. The method of claim 1 whereinthe online learning qualifier uses different information than thedecision-making process.
 5. The method of claim 4 wherein the onlinelearning qualifier uses a regulation tree classifier.
 6. The method ofclaim 1 wherein the decision system update adds the at least onequalified object into a kernel-based model.
 7. The method of claim 1wherein the decision system update includes a clustering method.
 8. Themethod of claim 7 wherein the clustering method performs a warpingfunction.
 9. The method of claim 8 wherein the warping function is apiecewise linear function.
 10. A maturity estimation method for adecision system having a learning model distribution comprises thefollowing steps: (a) Collecting a plurality of objects during operationto form an operational model probability density distribution; (b)Comparing the learning model probability density distribution to theoperational model probability density distribution to generate a measureof dissimilarity; (c) Outputing a learning maturity estimate from themeasure of dissimilarity.
 11. The method of claim 10 wherein the measureof dissimilarity is a mean CDF value.
 12. The method of claim 10 whereinthe measure of dissimilarity is derived from the χ² test.
 13. The methodof claim 10 wherein the measure of dissimilarity is derived from theKolmogorov-Smrnow test.
 14. The method of claim 10 wherein the maturityestimate can be calculated by the following rule:MaturityEstimate=1−2|measure_of_dissimilarity−0.5|
 15. A learning methodfor a decision system comprises the following steps: (a) inputting aplurality of expert labeled data; (b) performing startup learning usingthe expert labeled data to create a learning model; (c) performingonline learning to update the learning model.
 16. The method of claim 15wherein the learning model is a kernel-based model.
 17. The method ofclaim 15 wherein startup learning combines a functional model and anempirical model.
 18. The method of claim 15 wherein the online learningmethod further comprises the following steps: (a) collect a plurality ofobjects during operation of the decision system; (b) apply an onlinelearning qualifier to the collected plurality of objects to generate atleast one qualified object; (c) update the learning model using the atleast one qualified object.
 19. The method of claim 18 wherein theonline learning qualifier selects a fraction of highest-scoring objectsto disqualify
 20. The method of claim 18 wherein the online learningqualifier uses different information in the decision-making process. 21.The method of claim 18 wherein the learning model update adds the atleast one qualified object into a kernel-based model.
 22. The method ofclaim 18 wherein the learning model update includes a clustering method.23. The method of claim 22 wherein the clustering method performs awarping function.
 24. The method of claim 23 wherein the warpingfunction is a piecewise linear function.